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(57) A mechanism is provided for the automatic gen- 
eration of virus fingerprint data for use in detecting com- 
puter viruses and virus removal data for use in removing 
computer viruses from infected files. The fingerprint 
generation technique serves to identify the infected vi- 
rus carrying portions of a computer file and then search 
within those portions for matching blocks of bytes in ex- 
cess of a certain size that are consistently located at a 
predetermined position within the infected computer file 
such that they may be used to reliably detect that com- 
puter virus when it is infecting different host computer 
files. The removal data generation mechanism serves 
to search the infected computer file against a clean ver- 
sion of that computer file to identify matching blocks. 
Critical data missing within the infected computer file 
may be found within the virus carrying portions by the 
application of various decryption techniques. Cutting 
points to remove the virus carrying portions are identi- 
fied. The fingerprint data and the removal data are test- 
ed on pairs of clean and infected computer files to verify 
that they operate correctly. 
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[0001] This invention relates to the field of data 
processing systems. More particularly, this invention re- 
lates to the field of the building of detection and cleaning 5 
mechanisms for use against computer viruses. 
[0002] It is known to provide systems that detect and 
remove known computer viruses from computer files in- 
fected with those known computer viruses. When a new 
computer virus is released by a virus author, it generally 1 o 
rapidly comes to the notice of anti-virus system provid- 
ers. These anti-virus system providers must then ana- 
lyse the new computer virus to identify distinct charac- 
teristics of computer files infected with that computer vi- 
rus such that they may provide a detection mechanism is 
to their customers. In addition, the anti-virus providers 
must also analyse the computer files infected with the 
new computer virus to determine a way of removing the 
new computer virus from the computer file in question 
and recovering as much data as possible. 20 
[0003] The steps of identifying detection and cleaning 
techniques are typically performed by an experienced 
computer programmer in the field of anti-virus systems 
who is able to analyse an infected computer file to iden- 
tify the virus portions and develop a detection and clean- 25 
ing mechanism for the new virus. However, the number 
of new viruses and variants of existing viruses is in- 
creasing such that it is difficult for anti-virus system pro- 
viders to keep pace with the release of new viruses. 
Furthermore, steps that can increase the speed with so 
which virus detection and virus cleaning mechanisms 
can be developed and released to customers are 
strongly advantageous since a new virus will typically 
have then spread to a lesser extent and have caused 
less damage than if the release of the detection and 35 
cleaning mechanisms took longer. 
[0004] Viewed from one aspect the present invention 
provides a computer program product including a com- 
puter program for controlling a computer to automatical- 
ly generate virus fingerprint data for use in detecting 40 
computer files infected with a computer virus, said com- 
puter program comprising: 

comparison logic operable for each of a plurality of 
computer files to compare a computer file infected 45 
with said computer virus with a version of said com- 
puter file not infected with said computer virus to 
identify virus carrying portions of said computer file 
infected with said computer virus that are not 
present within said version of said computer file not so 
infected with said computer virus; 
block searching logic operable to search within said 
virus carrying portions of said plurality of computer 
programs infected with said computer virus for a 
matching block of code exceeding a threshold block 55 
size; 

pointer searching logic operable to search within 
said plurality of computer files infected with said 
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computer wus for a common pointer definition 
identifying a position of said matching block of code 
within each of said plurality of computer files infect- 
ed with said computer virus; and 
fingerprint generating logic operable to generate vi- 
rus fingerprint data representing said matching 
block of code and said common pointer definition 
such that said fingerprint data may be applied to a 
computer file suspected of being infected with said 
computer file to determine if said common block of 
code is present at a position indicated by said com- 
mon pointer definition thereby indicating infection 
by said computer virus. 

[0005] The invention provides an automated mecha- 
nism for generating virus fingerprint data that can be 
used in detecting a computer virus. This automated 
mechanism can be applied to a new virus candidate to 
rapidly generate virus fingerprint data. Whilst such an 
automated technique may network in all circumstances, 
it will work in sufficient circumstances only to require that 
the most difficult new computer viruses require special 
handling. Thus, the virus detection mechanism can be 
developed and released more rapidly. 
[0006] Whilst searching for matching blocks of code 
could take place starting from a variety of positions, it is 
preferable to search following the execution path start- 
ing from the entry point for the infected computer file. In 
order to operate a computer virus must at some point 
fall upon the execution path and accordingly following 
this execution path is a good way of consistently identi- 
fying the computer virus even though it may be posi- 
tioned at a variety of different locations within different 
infected files. 

[0007] Some computer viruses encrypt themselves 
within the computer file which they infect. This makes 
the computer viruses more difficult to detect since in 
their encrypted form they differ between different infect- 
ed computer files. However, the computer virus at some 
point usually decrypts itself into a form that is consistent 
between different infected computer files. Accordingly, 
preferred embodiments of the invention operate such 
that if an adequate detection mechanism cannot be de- 
rived directly from the infected computer file, then exe- 
cution of the computer programs infected with the virus- 
es is emulated until execution of a portion of code written 
during the emulation is reached. 
[0008] Execution of a portion of code that has been 
written during emulation is likely to be a portion of de- 
crypted, or partially decrypted, virus code which increas- 
es the probability of consistent matches being achieved 
between different infected computer files emulated in 
this way. 

[0009] If emulation is conducted more than a thresh- 
old number of times and a detection mechanism has still 
not been found, then preferred embodiment recognise 
this and terminate the attempt to automatically generate 
the fingerprint data. Such cases may then be passed to 
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an expert human operator for further investigation. 
[001 0] Viewed from another aspect the invention pro- 
vides a computer program product including a computer 
program for controlling a computer to automatically gen- 
erate virus removal data defining how a computer virus 
may be removed from a computer file infected with.said 
computer virus, said computer program comprising: 

virus portion identifying logic operable to identify as 
virus carrying portions one or more portions of said 
computer file infected with said computer virus not 
matching blocks of code within said version of said 
computer file not infected with said computer virus; 
block searching logic operable to search within said 
virus carrying portions for blocks of code matching 
a version of said computer file not infected with said 
computer virus; 

pointer identifying logic operable to identify one or 
more location pointers to said matching blocks of 
code within said computer file infected with said 
computer virus; 

testing logic operable to test said one or more loca- 
tion pointers to check that they identify matching 
blocks of code within ail of a set of other computer 
files infected with said computer virus; and 

removal data generating logic operable to 
generate said virus removal data representing said 
one or more location pointers, block sizes and orig- 
inal positions such that virus removal data may be 
applied to a computer file infected with said compu- 
ter virus to remove said computer virus. 

[001 1] A complimentary aspect of the invention is the 
provision of a mechanism for automatically generating 
virus removal data that may be used to remove a com- 
puter virus from a computer file. 

[001 2] It will be appreciated that in some circumstanc- 
es the ability to clean a computer file may be limited and 
it may not be possible to recover all of the data from the 
original uninfected computer file. Preferred embodi- 
ments of the invention are such that if a critical portion 
of the uninfected computer file is not found within the 
infected computer file, then decryption techniques are 
applied to the infected computer file to see if it may be 
recovered. Applying decryption techniques for critical 
portions only is a good balance between the time taken 
to attempt to decrypt the infected computer file weighed 
against the need to recover the particular critical por- 
tions of the uninfected computer file. 
[0013] As previously mentioned, some computer vi- 
ruses disguise themselves and may also disguise por- 
tions of the original computer file in a manner such that 
if emulation of execution of the infected computer file is 
undertaken, then portions of the original computer file 
will be decrypted by the virus itself. Accordingly, pre- 
ferred embodiments of the invention emulate execution 
to attempt to find further matching blocks of code. 
[0014] The emulation performed is only carried out to 



a threshold number of times before the search for an 
automatically generated removal technique is terminat- 
ed and the task passed to a human operator. 
[0015] Further aspects of the invention also provide a 
5 method and an apparatus for automatically generating 
virus fingerprint data and virus removal data. 
[0016] Embodiments of the invention will now be de- 
scribed, by way of example only, with reference to the 
accompanying drawings in which: 

w 

Figure 1 schematically illustrates a clean version of 
a computer file and an infected version of a compu- 
ter file; 
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Figure 2 schematically illustrates three different 
computer files infected with the same computer vi- 
rus; 

Figure 3 schematically illustrates a computer virus 
that seeks to hide itself by encryption and decrypts 
itself when it runs; 

Figure 4 is a flow diagram schematically illustrating 
the processing performed when automatically iden- 
tifying virus definition data; 

Figure 5 is a flow diagram schematically illustrating 
the processing performed when automatically gen- 
erating virus removal data; and 

Figure 6 is a diagram schematically illustrating a 
general purpose computer of the type that may be 
used to implement the above described techniques. 



35 [0017] Figure 1 schematically illustrates a clean ver- 
sion of a computer file 2 and an infected version of the 
same computer file 4, which has been infected by a com- 
puter virus. It will be appreciated that this is only one 
example of how a computer virus may operate. In this 

40 example, the computer virus has divided the original 
clean computer file 2 into three portions and has insert- 
ed virus code in the two gaps between these three por- 
tions. The original computer program entry point was at 
the beginning of the clean computer program 2. In the 

45 infected computer program 4, the virus has modified the 
entry point to start execution at a small area of virus en- 
try point code 6. When the infected computer program 
4 is executed, the virus entry code 6 is first run and then 
jumps execution to the virus main body 8. The virus main 

so body 8 may be encrypted, in which case part of the job 
of the virus entry code 6 would be to decrypt the virus 
main body 8 and then start executing the decrypted 
code. 

[0018] It will be appreciated that a different virus may 
55 insert itself at a single point or at more than two points. 
A virus may serve to vary its insertion position but will 
typically have some fixed relationship to a reference 
point within the computer file, e.g. the virus might for 
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example always be founcraarting a predetermined dis- 
tance from the end of the computer fife, from the begin- 
ning of the computer file, from an entry point of the com- 
puter file, a certain proportion of the way through a com- 
puter file, etc. As well as having common blocks of code, 5 
a particular computer virus will also typically have a 
common way in which it locates itself within the host file 
and this needs to be identified in order to generate ap- 
propriate pointers to the virus that can be used in detec- 
tion and cleaning mechanisms. w 
[0019] Figure 2 schematically illustrates three differ- 
ent host computer files each infected with the same 
computer virus. In these different host computer files, 
the main virus body is consistently found at a fixed offset 
from the end of the computer file in question. The virus is 
entry code in these examples is always at the start of 
the computer file. This contrasts with the Figure 1 em- 
bodiment in which the infected version of the computer 
file moved the entry point to part way through the com- 
puter file. The virus entry code in the three examples of 20 
Figure 2 is identical, as is represented by "* 1 =* 2 =* 3 W . 
[0020] In the example illustrated in Figure 2, the virus 
main bodies differ between the computer files but a cer- 
tain portion indicated by M # 1 =# 2=: #3 W is matching be- 
tween the three different infected files and may be used 25 
to identify the virus. 

[0021] Figure 3 illustrates the situation in which an in- 
fected computer file has an encrypted main virus body 
10. Virus entry and decryption code 12 when activated 
serves to decrypt the main body of the virus and then 30 
execute that main body. Such a encryption technique is 
an attempt by the virus author to make the computer 
virus more difficult to detect since the same computer 
virus will not share common main body encrypted code 
between different infected files. 35 
[0022] As is illustrated in Figure 3, the virus entry and 
decryption code 12 serves to decrypt the encrypted 
main body 10 and write a new decrypted main body 14. 
In this example, the decrypted main body 14 is illustrat- 
ed as being appended to the end of the computer file. It 40 
will be appreciated that the decrypted main body 14 
could be written into other locations, such as over the 
encrypted main body 10, at a new position within the 
infected computer file, or even in a different area, such 
as into memory. However, at some point the virus entry 45 
and decryption code 12 will jump to start execution of 
the decrypted main body code 14. It is at this point that 
an attempt to match common decrypted main body code 
14 between different infected computer files may be 
made. Execution of the infected computer file may be 50 
emulated (using known techniques) to identify this point 
and to generate the corresponding decrypted main body 
code 14. 

[0023] Figure 4 schematically illustrates the process- 
ing performed to automatically generate virus fingerprint 55 
data. As a required input to this example system there 
are needed a plurality of pairs of clean and infected ver- 
sion of a computer file. At step 1 6 a clean computer file 
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and a correspoTTSing infected computer file are com- 
pared on a byte-by-byte basis to identify matching and 
non-matching portions, the non-matching portions are 
identified as the virus carrying portions. Step 16 is re- 
peated for a plurality (X) pairs of clean and infected com- 
puter files. 

[0024] At step 1 8 a search is made across the plurality 
of infected computer files (and decrypted regions sub- 
sequent to emulation as will be discussed below) start- 
ing from their entry points looking within their virus car- 
rying portions for matching blocks of bytes. If no match 
is found, then processing proceeds to step 20. If a match 
is found, then processing proceeds to step 22. Step 22 
searches across the plurality of infected files within 
which the matching blocks of bytes have been found to 
identify common pointers to those matching blocks that 
may be used in an automatic detection mechanism as 
part of the fingerprint data. The candidate types of com- 
mon pointer could take a variety of forms, such as look- 
ing for a fixed distance from the start of the file, a fixed 
distance from the end of the file, a fixed distance from 
some other reference point within the file etc. 
[0025] If no common pointer is found, then processing 
again proceeds to step 20. If a common pointer is found, 
then step 24 seeks to identify if the matching block of 
code found across all of the samples of the infected 
computer file with a derivable common pointer to that 
block has a size over a threshold value N. If the matching 
block is too small, then it is not suitable for use in a virus 
detection mechanism. If the matching block is too small 
as determined at step 24, then processing proceeds to 
step 20. If the matching block is over the threshold size 
as determined at step 24, then processing proceeds to 
step 26 at which the fingerprint definition data is gener- 
ated specifying the matching block of bytes that may be 
searched for and the common position that matching 
block of bytes will have within an infected computer file. 
[0026] Step 20 determines whether or not a threshold 
number of emulation attempts have already been made 
upon the candidate infected computer files. If emulation 
has already been tried over a threshold number of times 
without successfully being able to generate fingerprint 
definition data, then processing proceeds to step 28 at 
which the automatic process fails with appropriate gen- 
eration of an error message and further investigation of 
the infected computer files manually by an expert in 
computer viruses is required. If the emulation count has 
not been exceeded as is identified at step 20, then 
processing proceeds to step 30 at which emulation of 
execution of the plurality of infected computer files is 
performed until for each of the infected computer files 
the execution point reaches a portion of code that was 
written during the emulation. This is indicative of the be- 
haviour of a computer virus that seeks to hide itself by 
encryption which must first decrypt itself before starting 
to execute the decrypted virus code. Once the plurality 
of infected computer files have all had their execution 
emulated up to this point, then processing is returned to 
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step 18 where a fresh search for matching blocks of 
bytes and common pointers may be made. 
[0027] Emulation until written memory is executed is 
not easy to implement and its implementation might use 
significant system resources. So it can not be consid- 
ered as a basic emulation operation hence should not 
be included in removal sequence as a single step. The 
preferred embodiment may contain an algorithm that 
tries to replace operation "emulate until written memory 
is executed" with one or more simple basic emulating 
techniques. (Examples: "emulate specified number of 
CPU instructions", "emulate until execution path ex- 
ceeds specified area", "emulate until particular byte 
string is found in specified area", etc.) This algorithm 
may perform emulation until written memory is executed 
for each infected file in the set, at the same time trying 
to determine suitable parameters for those basic emu- 
lating techniques that would give the same decrypted 
areas. The algorithm chooses the best technique which 
have the most similar parameters for all infected files in 
the set then determines the best parameters that would 
produce expected decrypted areas for each infected file 
in the set. Obtained technique along with its parameters 
are included in the resulting removal sequence (if emu- 
lation appears to be necessary). 
[0028] Figure 5 is a flow diagram illustrating the 
processing performed in seeking to automatically gen- 
erate virus removal data. The required input to this ex- 
ample process is at least two pairs of clean and infected 
files. At step 32, a pair comprising a clean and infected 
file are integrity checked. In this context, the infected file 
is checked to see if it is smaller or not significantly larger 
than the clean file. If the infected file is smaller or not 
significantly larger than the clean file, then it is highly 
probable that a large portion of the data contained within 
the cleaned file has been overwritten and this pair of 
infected and clean files will not make a good candidate 
for identifying automatic virus removal data. According- 
ly, if the integrity check of step 32 is failed, then process- 
ing proceeds to step 34 at which another base file pair 
are selected. 

[0029] If the integrity check of step 32 is passed, then 
processing proceeds via step 33 to step 36. Step 33 
compares the clean and infected files to identify virus 
carrying regions. Step 36 serves to search the infected 
regions (or on subsequent passes through the process- 
ing a decrypted region) for the largest matching block of 
bytes corresponding to the clean file. A pointer (or a plu- 
rality of candidate pointers) to that matching region is 
also derived. Step 38 serves to confirm the identity of 
such a region for recovery by testing to see if such a 
matching region occurs at the same pointed to location 
within all other pairs of a set of infected files and clean 
files 

[0030] If there are any critical portions of the infected 
file that have not been found, then step 40 serves to 
attempt decryption of the virus carrying portions of the 
infected computer file using a set of known, and rela- 



tively simple, decryption techniques in an attempt to 
identify the critical portions of the clean computer file 
that are still missing. 

[0031 ] Step 40, then step 42 serves to confirm the de- 
5 cryption in technique and pointer to the data to be de- 
crypted by attempting a recovery of similar critical data 
in all further example infected computer files until a 
match is found indicating a reliable way of recovering 
encrypted critical data. 

10 [0032] Step 44 determines whether or not there are 
still some regions of the clean computer file that have 
not been recovered. If all of the clean computer file has 
been recovered, then processing proceeds to step 46 
at which the cut points within the infected computer file 

*5 that identify the remaining virus carrying portions may 
be identified such that a repair technique may use these 
to select the virus carrying portions to be removed dur- 
ing the cleaning process. The cut points may be identi- 
fied in a variety of different ways, such as an end-of-file 

20 type marker after rearrangement such that when repair- 
ing a file all data after such a marker will be known to 
be data to be discarded. Alternatively for some viruses, 
the data to be discarded may be concentrated at the be- 
ginning of afile with the file being collapsed with removal 

25 of this data as the best step of a repair operation. At step 
48, the virus removal data that has been generated is 
tested on all other further pairs of a set of infected files 
and clean files to check th at it does work. If the technique 
does not work, then processing proceeds to step 50 

30 which determines whether or not the process has been 
tried on two infected base files already. If removal data 
has already been attempted for generation on two can- 
didate base files without success, then processing pro- 
ceeds to step 52 at which the automatic process fails 

35 and an appropriate error message is generated such 
that the generation of removal data may be referred for 
manual investigation by an expert. If two infected base 
files have not already been tried, then processing pro- 
ceeds to step 34. 

40 [0033] If the test at step 44 indicated that some re- 
gions of the clean computer file have still not been found, 
then processing proceeds to step 54. Step 54 deter- 
mines whether in excess of a threshold number of em- 
ulation attempts have been performed upon the infected 

45 computer file. If emulation has been performed in ex- 
cess of this threshold level, then processing proceeds 
to step 46. If emulation has not yet been tried in excess 
of the threshold level, then processing proceeds to step 
56 at which emulation of execution of the infected com- 

50 puter file and any other infected computer files being 
used to cross check the data produced is performed until 
the execution path within those infected computer files 
reaches a written portion. This corresponds to the com- 
puter virus having decrypted a portion of its body or a 

55 portion of the original clean computer file. After step 56, 
processing returns to step 36 at which the searching for 
matching blocks of bytes and pointers to those matching 
blocks of bytes as well as critical portions of the clean 
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file are repeated as described above but in this case car- 
ried out upon the decrypted regions. 
[0034] Figure 6 schematically illustrates a general 
purpose computer 200 of the type that may be used to 
implement the above techniques. The general purpose 5 
computer 200 includes a central processing unit 202, a 
random access memory 204, a read only memory 206, 
a hard disk drive 208, a display driver 210 and display 
212, a user input/output circuit 214 and keyboard 216 
and mouse 21 8 and a network interface unit 220 all con- 10 
nected via a common bus 222. In operation the central 
processing unit 202 executes program instructions 
stored within the random access memory 204, the read 
only memory 206 or the hard disk drive 208. The working 
memory is provided by the random access memory 204. 15 
The program instructions could take a variety of forms 
depending on the precise nature of the computer 200 
and the programming language being used. The results 
of the processing are displayed to a user upon the dis- 
play 212 driven by the display driver 210. User inputs 20 
for controlling the general purpose computer 200 are re- 
ceived from the keyboard 216 and the mouse 218 via 
the user input/output circuit 214. Communication with 
other computers, such as exchanging e-mails, down- 
loading files or providing internet or other network ac- 25 
cess, is achieved via the network interface unit 220. 
[0035] It will be appreciated that the general purpose 
computer 200 operating under control of a suitable com- 2. 
puter program may perform the above described tech- 
niques and provide apparatus for performing the various 30 
tasks described. The general purpose computer 200 al- 
so executes the method described previously. The com- 
puter program product could take the form of a record- 
able medium bearing the computer program, such as a 3. 
floppy disk, a compact disk or other recordable medium. 35 
Alternatively, the computer program could be dynami- 
cally downloaded via the network interface unit 220. 
[0036] It will be appreciated that the general purpose 
computer 200 is only one example of the type of com- 
puter architecture that may be employed to carry out the 40 
above described techniques. Alternative architectures 
are envisaged and are capable of use with the above 
described techniques. 
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said cdTflfjuter file infected with said computer 
virus that are not present within said version of 
said computer file not infected with said com- 
puter virus; 

block searching logic operable to search within 
said virus carrying portions of said plurality of 
computer programs infected with said compu- 
ter virus for a matching block of code exceeding 
a threshold block size; 

pointer searching logic operable to search with- 
in said plurality of computer files infected with 
said computer virus for a common pointer def- 
inition identifying a position of said matching 
block of code within each of said plurality of 
computer files infected with said computer vi- 
rus; and 

fingerprint generating logic operable to gener- 
ate virus fingerprint data representing said 
matching block of code and said common point- 
er definition such that said fingerprint data may 
be applied to a computer file suspected of being 
infected with said computer file to determine if 
said common block of code is present at a po- 
sition indicated by said common pointer defini- 
tion thereby indicating infection by said compu- 
ter virus. 

A computer program product as claimed in claim 1 , 
wherein said searching for a matching block of code 
starts following an execution path starting at an en- 
try point for each computer file infected with said 
computer virus. 

A computer program product as claimed in claims 
1 and 2, wherein if one or more of said block search- 
ing logic and said pointer searching logic do not suc- 
ceed, then emulation logic operates to emulate ex- 
ecution of each of said computer programs infected 
with said computer virus until execution of a portion 
of code written during said emulation is reached 
whereupon said searching for a matching block of 
code and said searching for a common pointer def- 
inition are repeated. 



45 4. a computer program product as claimed in claim 3, 
wherein said emulation is repeated up to a thresh- 
old number of emulations, whereupon if one or 
more of said matching block of code and said com- 
mon pointer definition are not found then said com- 

50 puter program product is terminated without gener- 
ating said virus fingerprint data. 



Claims 

1 . A computer program product including a computer 
program for controlling a computer to automatically 
generate virus fingerprint data for use in detecting 
computer files infected with a computer virus, said 
computer program comprising: 

comparison logic operable for each of a plural- 
ity of computer files to compare a computer file 
infected with said computer virus with a version 
of said computer file not infected with said com- 
puter virus to identify virus carrying portions of 



5. A computer program product as claimed in claim 3, 
wherein said emulation in a removal sequence is 
replaced by simplified emulation using one of the 
following techniques: 

(i) a predetermined number of instructions have 
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been emulated; 

(if) emulation of an instruction outside of a pre- 
determined boundary is reached; and 
(iii) a predetermined byte string is generated. 

5 

6. A computer program product including a computer 
program for controlling a computer to automatically 
generate virus removal data def ining how a compu- 
ter virus may be removed from a computer file in- 
fected with said computer virus, said computer pro- io 
gram comprising: 

virus portion identifying logic operable to iden- 
tify as virus carrying portions one or more por- 
tions of said computer file infected with said 15 
computer virus not matching blocks of code 
within said version of said computer file not in- 
fected with said computer virus; 
block searching logic operable to search within 
said virus carrying portions for blocks of code 20 
matching a version of said computer file not in- 
fected with said computer virus; 
pointer identifying logic operable to identify one 
or more location pointers to said matching 
blocks of code within said computer file infected 25 
with said computer virus; 
testing logic operable to test said one or more 
location pointers to check that they identify 
matching blocks of code within all of a set of 
other computer files infected with said compu- 30 
ter virus; and 

removal data generating logic operable to gen- 
erate said virus removal data representing said 
one or more location pointers, block sizes and 
original positions such that virus removal data 35 
may be applied to a computer file infected with 
said computer virus to remove said computer 
virus. 

7. A computer program product as claimed in claim 6, 40 
wherein for critical portions of said version of said 
computer file not infected with said computer virus 
that are not present within said block of matching 
code, said virus carrying portions are subject to one 

or more decryption techniques to seek to identify 45 
said critical portions within said infected portions 
such that said virus removal data can specify how 
to recover said critical portions. 

8. A computer program product as claimed in claims so 
6 and 7, wherein if matching blocks of code have 

not been found within said computer file infected 
with said computer virus corresponding to ail por- 
tions of said version of said computer file not infect- 
ed with said computer virus, then execution of said 55 
computer file infected with said computer virus is 
emulated until execution of code written during said 
emulation is reach, whereupon a further attempt to 
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find said matching blocks of code is made. 

9. A computer program product as claimed in claim 8, 
wherein said emulation is repeated up to a thresh- 
old number of emulations. 

10. A method of automatically generating virus finger- 
print data for use in detecting computer files infect- 
ed with a computer virus, said method comprising 
the steps of: 

for each of a plurality of computer files, com- 
paring a computer file infected with said com- 
puter virus with a version of said computer file 
not infected with said computer virus to identify 
virus carrying portions of said computer file in- 
fected with said computer virus that are not 
present within said version of said computer file 
not infected with said computer virus; 
searching within said virus carrying portions of 
said plurality of computer programs infected 
with said computer virus for a matching block 
of code exceeding a threshold block size; 
searching within said plurality of computer files 
infected with said computer virus for a common 
pointer definition identifying a position of said 
matching block of code within each of said plu- 
rality of computer files infected with said com- 
puter virus; and 

generating virus fingerprint data representing 
said matching block of code and said common 
pointer definition such that said fingerprint data 
may be applied to a computer file suspected of 
being infected with said computer file to deter- 
mine if said common block of code is present 
at a position indicated by said common pointer 
definition thereby indicating infection by said 
computer virus. 

11. A method as claimed in claim 10, wherein said 
searching for a matching block of code starts follow- 
ing an execution path starting at an entry point for 
each computer file infected with said computer vi- 
rus. 

12. A method as claimed in claims 10 and 11, wherein 
if one or more of said step of searching for a match- 
ing block of code and said step of searching for a 
common pointer definition does not succeed, then 
emulating execution of each of said computer pro- 
grams infected with said computer virus until exe- 
cution of a portion of code written during said emu- 
lation is reached whereupon said searching for a 
matching block of code and said searching for a 
common pointer definition are repeated. 

13. A method as claimed in claim 1 2, wherein said step 
of emulating is repeated up to a threshold number 
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of emulations, whereupon if one or more of said 
matching block of code and said common pointer 
definition are not found then said method is termi- 
nated without generating said virus fingerprint data. 

14. A method as claimed in claim 12, wherein said step 
of emulating in a removal sequence is replaced by 
simplified emulation using one of the following tech- 
niques: 
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(i) a predetermined number of instructions have 
been emulated; 

(ii) emulation of an instruction outside of a pre- 
determined boundary is reached; and 

(iii) a predetermined byte string is generated. is 

15. A method of automatically generating virus removal 
data defining how a computer virus may be re- 
moved from a computer file infected with said com- 
puter virus, said method comprising the steps of: 20 

identifying as virus carrying portions one or 
more portions of said computer file infected with 
said computer virus not matching blocks of 
code within said version of said computer file 25 
not infected with said computer virus; 
searching within said virus carrying portions for 
blocks of code matching a version of said com- 
puter file not infected with said computer virus; 
identifying one or more location pointers to said 30 
matching blocks of code within said computer 
file infected with said computer virus; 
testing said one or more location pointers to 
check that they identify matching blocks of code 
within allof asetof other computer files infected 35 
with said computer virus; and 
generating said virus removal data represent- 
ing said one or more location pointers, block 
sizes and original positions such that virus re- 
moval data may be applied to a computer file 40 
infected with said computer virus to remove 
said computer virus. 

1 6. A method as claimed in claim 1 5, wherein for critical 
portions of said version of said computer file not in- 45 
fected with said computer virus that are not present 
within said block of matching code, said virus car- 
rying portions are subject to one or more decryption 
techniques to seek to identify said critical portions 
within said infected portions such that said virus re- 50 
moval data can specify how to recover said critical 
portions. 

17. A method as claimed in claims 15 and 16, wherein 

if matching blocks of code have not been found 55 
within said computer file infected with said compu- 
ter virus corresponding to all portions of said version 
of said computer file not infected with said computer 



virus, then execution of said computer file infected 
with said computer virus is emulated until execution 
of code written during said emulation is reach, 
whereupon a further attempt to find said matching 
blocks of code is made. 

18. A method as claimed in claim 17, wherein said step 
of emulating is repeated up to a threshold number 
of emulations. 

19. Apparatus for automatically generating virus finger- 
print data for use in detecting computer files infect- 
ed with a computer virus, said apparatus compris- 
ing: 

a comparitor operable for each of a plurality of 
computer files to compare a computer file in- 
fected with said computer virus with a version 
of said computer file not infected with said com- 
puter virus to identify virus carrying portions of 
said computer file infected with said computer 
virus that are not present within said version of 
said computer file not infected with said com- 
puter virus; 

a block searcher operable to search within said 
virus carrying portions of said plurality of com- 
puter programs infected with said computer vi- 
rus for a matching block of code exceeding a 
threshold block size; 

a pointer searcher operable to search within 
said plurality of computer files infected with said 
computer virus for a common pointer definition 
identifying a position of said matching block of 
code within each of said plurality of computer 
files infected with said computer virus; and 
a fingerprint generator operable to generate vi- 
rus fingerprint data representing said matching 
block of code and said common pointer defini- 
tion such that said fingerprint data may be ap- 
plied to a computer file suspected of being in- 
fected with said computer file to determine if 
said common block of code is present at a po- 
sition indicated by said common pointer defini- 
tion thereby indicating infection by said compu- 
ter virus. 

20. Apparatus as claimed in claim 19, wherein said 
searching for a matching block of code starts follow- 
ing an execution path starting at an entry point for 
each computer file infected with said computer vi- 
rus. 

21. Apparatus as claimed in claims 19 and 20, wherein 
if one or more of said block searching logic and said 
pointer searching logic do not succeed, then an em- 
ulator operates to emulate execution of each of said 
computer programs infected with said computer vi- 
rus until execution of a portion of code written during 
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said emulation is reached whereupon said search- 
ing for a matching block of code and said searching 
for a common pointer definition are repeated. 

22. Apparatus as claimed in claim 21 , wherein said em- 5 
ulation is repeated up to a threshold number of em- 
ulations, whereupon if one or more of said matching 
block of code and said common pointer definition 
are not found then said computer program product 

is terminated without generating said virus finger- 10 
print data. 

23. Apparatus for automatically generating virus re- 
moval data defining how a computer virus may be 
removed from a computer file infected with said 15 
computer virus, said apparatus comprising: 

a virus portion identifier operable to identify as 
virus carrying portions one or more portions of 
said computer file infected with said computer 20 
virus not matching blocks of code within said 
version of said computer file not infected with 
said computer virus; 

a block searcher operable to search within said 
virus carrying portions for blocks of code 25 
matching a version of said computer file not in- 
fected with said computer virus; 
a pointer identifier operable to identify one or 
more location pointers to said matching blocks 
of code within said computer file infected with 30 
said computer virus; 

a tester operable to test said one or more loca- 
tion pointers to check that they identify match- 
ing blocks of code within all of a set of computer 
files infected with said computer virus; and 35 
a removal data generator operable to generate 
said virus removal data representing said one 
or more location pointers, block sizes and orig- 
inal positions such that virus removal data may 
be applied to a computer file infected with said 40 
computer virus to remove said computer virus. 
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with said computer virus is emulated until execution 
of a portion of code written during said emulation is 
reached whereupon said searching for a matching 
block of code and said searching for a common 
pointer definition are repeated. 

26. Apparatus as claimed in claim 25, wherein said em- 
ulation is repeated up to a threshold number of em- 
ulations. 

27. Apparatus as claimed in claim 25, wherein said em- 
ulation in a removal sequence is replaced by sim- 
plified emulation using one of the following tech- 
niques: 

(i) a predetermined number of instructions have 
been emulated; 

(ii) emulation of an instruction outside of a pre- 
determined boundary is reached; and 

(iii) a predetermined byte string is generated. 



24. Apparatus as claimed in claim 23, wherein for criti- 
cal portions of said version of said computer file not 
infected with said computer virus that are not 45 
present within said block of matching code, said vi- 
rus carrying portions are subject to one or more de- 
cryption techniques to seek to identify said critical 
portions within said infected portions such that said 
virus removal data can specify how to recover said so 
critical portions. 

25. Apparatus as claimed in claims 23 and 24, wherein 
if matching blocks of code have not been found 
within said computer file infected with said compu- 55 
ter virus corresponding to all portions of said version 

of said computer file not infected with said computer 
virus, then execution of said computer file infected 
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